software artifact
An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks
Zhou, Xin, Kim, Kisub, Zhang, Ting, Weyssow, Martin, Gomes, Luis F., Yang, Guang, Liu, Kui, Xia, Xin, Lo, David
Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, many automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SE-Jury, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SE-Jury first defines five distinct evaluation strategies, each implemented by an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges as a team to produce a final correctness score through ensembling. We evaluate SE-Jury across a diverse set of software engineering (SE) benchmarks that span three popular SE tasks: code generation, automated program repair, and code summarization. Results demonstrate that SE-Jury consistently achieves a higher correlation with human judgments, with improvements ranging from 29.6% to 140.8% over existing automatic metrics. SE-Jury reaches agreement levels with human annotators that are close to inter-annotator agreement in code generation and program repair. These findings underscore SE-Jury's potential as a scalable and reliable alternative to human evaluation in these SE tasks.
Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair
Feng, Qiong, Ma, Xiaotian, Sheng, Jiayi, Feng, Ziyuan, Song, Wei, Liang, Peng
LLMs have garnered considerable attention for their potential to streamline Automated Program Repair (APR). LLM-based approaches can either insert the correct code or directly generate patches when provided with buggy methods. However, most of LLM-based APR methods rely on a single type of software information, without fully leveraging different software artifacts. Despite this, many LLM-based approaches do not explore which specific types of information best assist in APR. Addressing this gap is crucial for advancing LLM-based APR techniques. We propose DEVLoRe to use issue content (description and message) and stack error traces to localize buggy methods, then rely on debug information in buggy methods and issue content and stack error to localize buggy lines and generate plausible patches which can pass all unit tests. The results show that while issue content is particularly effective in assisting LLMs with fault localization and program repair, different types of software artifacts complement each other. By incorporating different artifacts, DEVLoRe successfully locates 49.3% and 47.6% of single and non-single buggy methods and generates 56.0% and 14.5% plausible patches for the Defects4J v2.0 dataset, respectively. This outperforms current state-of-the-art APR methods. The source code and experimental results of this work for replication are available at https://github.com/XYZboom/DEVLoRe.
- Asia > China > Jiangsu Province > Nanjing (0.05)
- Asia > China > Hubei Province > Wuhan (0.04)
- Asia > Middle East > Republic of Türkiye (0.04)
GitLab Acquires UnReview to Further AI Ambitions - DevOps.com
GitLab announced this week it has acquired UnReview, a provider of a tool that employs machine learning algorithms to both identify which expert code reviewers to assign to a project based on the quality of their previous efforts and current workloads. David DeSanto, senior director for product management at GitLab, said the acquisition of UnReview is the latest step in an AI strategy that, in addition to optimizing DevOps processes, will also eventually unify machine learning operations (MLOps) and DevOps workflows. Accessed via the Dev section of the GitLab platform, UnReview will also be employed to manage the overall code review process. DeSanto said GitLab is committed to employing AI technologies to automate workflows and compressing cycle times across all stages of the DevSecOps life cycle. The goal is to not eliminate the need for DevOps teams but rather eliminate low-level tasks that conspire to hamper productivity, while at the same time improving application security, noted DeSanto.
On Cycling Risk and Discomfort: Urban Safety Mapping and Bike Route Recommendations
Castells-Graells, David, Salahub, Christopher, Pournaras, Evangelos
Bike usage in Smart Cities becomes paramount for sustainable urban development. Cycling provides tremendous opportunities for a more healthy lifestyle, lower energy consumption and carbon emissions as well as reduction of traffic jams. While the number of cyclists increase along with the expansion of bike sharing initiatives and infrastructures, the number of bike accidents rises drastically threatening to jeopardize the bike urban movement. This paper studies cycling risk and discomfort using a diverse spectrum of data sources about geolocated bike accidents and their severity. Empirical continuous spatial risk estimations are calculated via kernel density contours that map safety in a case study of Zurich city. The role of weather, time, accident type and severity are illustrated. Given the predominance of self-caused accidents, an open-source software artifact for personalized route recommendations is introduced. The software is also used to collect open baseline route data that are compared with alternative ones that minimize risk or discomfort. These contributions can provide invaluable insights for urban planners to improve infrastructure. They can also improve the risk awareness of existing cyclists' as well as support new cyclists, such as tourists, to safely explore a new urban environment by bike.
- North America > United States (0.46)
- Europe > Switzerland > Zürich > Zürich (0.37)
- Europe > Austria > Vienna (0.14)
- Europe > Norway (0.04)
- Energy (0.74)
- Transportation > Infrastructure & Services (0.68)
- Health & Medicine > Consumer Health (0.48)
- Government > Regional Government (0.46)
Building the Universal Archive of Source Code
Software is becoming the fabric that binds our personal and social lives, embodying a vast part of the technological knowledge that powers our industry and fuels innovation. Software is a pillar of most scientific research activities in all fields, from mathematics to physics, from chemistry to biology, from finance to social sciences. Software is also an essential mediator for accessing any digital information. In short, a rapidly increasing part of our collective knowledge is embodied in, or dependent on, software artifacts. Our ability to design, use, understand, adapt, and evolve systems and devices on which our lives have come to depend relies on our ability to understand, adapt, and evolve the source code of the software that controls them.